3/07/2020

Our project

We decided to do central Colombia, basically because it is where the capital is.

We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.

Loading the dataset

Our cities

We decided to consider as central Colombia the following cities:

  • Bogotà DC,
  • Boyacá,
  • Tolima,
  • Cundinamarca,
  • Meta,
  • Quindío,
  • Valle del Cauca,
  • Risaralda, Celdas,
  • Boyacá,
  • Antioquia,
  • Santander
  • Casanare

Map

Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.

Preprocessing

##   ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1          1           2020-03-06              Bogotá             Bogotá D.C.
## 2          2           2020-03-09                Buga         Valle del Cauca
## 3          3           2020-03-09            Medellín               Antioquia
## 4          4           2020-03-11            Medellín               Antioquia
## 5          5           2020-03-11            Medellín               Antioquia
## 6          6           2020-03-11              Itagüí               Antioquia
##     Atención Edad Sexo        Tipo País de procedencia Grupo de edad
## 1 Recuperado   19    F   Importado              Italia         19-30
## 2 Recuperado   34    M   Importado              España         31-45
## 3 Recuperado   50    F   Importado              España         46-60
## 4 Recuperado   55    M Relacionado            Colombia         46-60
## 5 Recuperado   25    M Relacionado            Colombia         19-30
## 6       Casa   27    F Relacionado            Colombia         19-30
##   Continente de procedencia
## 1                    Europa
## 2                    Europa
## 3                    Europa
## 4                  Colombia
## 5                  Colombia
## 6                  Colombia
##          Date Elapsed time Department New cases/day Cumulative cases/Department
## 2  2020-03-09            3  Antioquia             1                           1
## 4  2020-03-11            5  Antioquia             3                           4
## 9  2020-03-14            8  Antioquia             3                           7
## 11 2020-03-15            9  Antioquia             1                           8
## 24 2020-03-19           13  Antioquia             3                          11
## 28 2020-03-20           14  Antioquia            11                          22

Exploring the dataset

Number of cases confirmed day by day

The frequentist approach

Other slide

Something.

The Bayesian approach

Poisson regression

As a first attempt, we fit a simple Poisson regression:

\[ ln\lambda_i = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \] with \(i = 1,\dots,83\), and \(y_i\) represents the number of cases.

## Running /usr/lib/R/bin/R CMD SHLIB foo.c
## gcc -std=gnu99 -I"/usr/share/R/include" -DNDEBUG   -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/Rcpp/include/"  -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/"  -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/unsupported"  -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/BH/include" -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/StanHeaders/include/src/"  -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/StanHeaders/include/"  -I"/home/angela/.R/x86_64-pc-linux-gnu-library/3.6/rstan/include" -DEIGEN_NO_DEBUG  -D_REENTRANT  -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -include stan/math/prim/mat/fun/Eigen.hpp     -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-V28x5H/r-base-3.6.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c foo.c -o foo.o
## In file included from /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/Eigen/Core:88:0,
##                  from /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/Eigen/Dense:1,
##                  from /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:4,
##                  from <command-line>:0:
## /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name ‘namespace’
##  namespace Eigen {
##  ^~~~~~~~~
## /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
##  namespace Eigen {
##                  ^
## In file included from /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/Eigen/Dense:1:0,
##                  from /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:4,
##                  from <command-line>:0:
## /home/angela/.R/x86_64-pc-linux-gnu-library/3.6/RcppEigen/include/Eigen/Core:96:10: fatal error: complex: File o directory non esistente
##  #include <complex>
##           ^~~~~~~~~
## compilation terminated.
## /usr/lib/R/etc/Makeconf:168: recipe for target 'foo.o' failed
## make: *** [foo.o] Error 1
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.166227 seconds (Warm-up)
## Chain 1:                0.118388 seconds (Sampling)
## Chain 1:                0.284615 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.4e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.151639 seconds (Warm-up)
## Chain 2:                0.120805 seconds (Sampling)
## Chain 2:                0.272444 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 1.2e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.154105 seconds (Warm-up)
## Chain 3:                0.112509 seconds (Sampling)
## Chain 3:                0.266614 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.2e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.166733 seconds (Warm-up)
## Chain 4:                0.1248 seconds (Sampling)
## Chain 4:                0.291533 seconds (Total)
## Chain 4:
## Inference for Stan model: poisson_regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd  2.5%  25%  50%  75% 97.5% n_eff Rhat
## alpha 0.17       0 0.13 -0.09 0.09 0.18 0.27  0.43   785 1.01
## beta  0.12       0 0.01  0.11 0.11 0.12 0.12  0.13   769 1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Jul  1 19:15:28 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Looking at Rhat we can see that we have reached the convergence.

theme_set(bayesplot::theme_default())

mcmc_scatter(as.matrix(fit.model.Poisson, pars=c("alpha", "beta") ), alpha=0.2)

Check the posterior:

y_rep <- as.matrix(fit.model.Poisson, pars="y_rep")
ppc_dens_overlay(y = model.data$cases, y_rep[1:200,]) 

The model is not able to capture low and high numbers of new cases.

The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis:

#in this way we check the standardized residuals
mean_y_rep<-colMeans(y_rep)
std_residual<-(model.data$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at +2 and -2 indicates approximate 95% error bounds). The variance of the standardized residuals is much greater than 1, indicating a large amount of overdispersion.

Classically the problem of having overdispersed data is solved using the negative binomial model instead of the Poisson one.

Negative binomial model

In order to deal with the problem of overdispersed data, we can try to fit the following Negative Binomial model:

\[ ln\lambda_i = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \]

Where the parameter \(\phi\) is called precision and it is such that:

\[ E[y_i] = \lambda \\ Var[y_i] = \lambda + \frac{\lambda^2}{\phi} \]